WEEK 4: DATA JOINS AND TRANSFORMATIONS

Monday, January 30th

Today we will…

Lab 3: Familiar Words – Sketch it out!

For each demographic group listed below, determine which word(s) in this study was(were) the most and least familiar on average.

Data Layouts

Tidy Data

Image source: R4DS

Untidy “Messy” Data

Illustration by Allison Horst

Our tools!

Illustration by Allison Horst

Collaboration with tidy data.

Illustration by Allison Horst

Pivoting Data

Tidyexpalin animation by Kelsey Gonzalez

Manual Method

We can do the wide-to-long transition manually.

Consider the table of daily rainfall observed in SLO in January 2023. This data is recorded in human-friendly form, in the approximate shape of a calendar. Each week has its own row, and each day has its own column.

Talk to a neighbor about how you would convert this to long format. You may want to open up the spreadsheet containing this table on your computer.

Data source

Manual Method: Steps

  1. Create a new column: Day_of_Week.

  2. Create a Rainfall column to hold the daily rainfall values.

  3. Now we have three columns setup (Week, Day_of_Week, and Rainfall) – start moving data over.

  4. Duplicate repeated data (Week 1-5) and copy Monday over.

  5. Duplicate repeated data (Week 1-5) and copy Tuesday over.

  6. Continue for the rest of the days of the week.

  7. You may want to arrange() the data by week to get chronological order of the rainfall values.

Computational Approach

pivot_longer()

library(readxl)
slo_rainfall <- read_xlsx("2023-rainfall-slo.xlsx")

slo_rainfall_long <- slo_rainfall |> 
  mutate(across(Sunday:Saturday, na_if, y = "NA"),
         across(Sunday:Saturday, as.numeric)
         ) |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
slo_rainfall_long
# A tibble: 35 × 3
    Week Day_of_Week Daily_Rainfall
   <dbl> <chr>                <dbl>
 1     1 Sunday                0   
 2     1 Monday                0.12
 3     1 Tuesday               0   
 4     1 Wednesday             1.58
 5     1 Thursday              0.91
 6     1 Friday                0   
 7     1 Saturday              0.05
 8     2 Sunday                0.27
 9     2 Monday                4.26
10     2 Tuesday               0.43
# … with 25 more rows

Why tidy data?

library(liver)
data(cereal)
str(cereal, hide.attr = TRUE)
'data.frame':   77 obs. of  16 variables:
 $ name    : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ manuf   : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
 $ type    : Factor w/ 2 levels "cold","hot": 1 1 1 1 1 1 1 1 1 1 ...
 $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
 $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
 $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
 $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
 $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
 $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
 $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
 $ potass  : int  280 135 320 330 -1 70 30 100 125 190 ...
 $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
 $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
 $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
 $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
 $ rating  : num  68.4 34 59.4 93.7 34.4 ...
cereal_summary1 <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
cereal_summary1
# A tibble: 3 × 10
  shelf calories protein   fat sodium fiber carbo sugars potass vitamins
  <int>    <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>
1     1     102.    2.65  0.6    176. 1.68   15.8   4.8    75.5     20  
2     2     110.    1.90  1      146. 0.905  13.6   9.62   57.8     23.8
3     3     108.    2.86  1.25   159. 3.14   14.5   6.53  130.      35.4
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_summary1 |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

cereal_summary2<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
cereal_summary2
# A tibble: 27 × 3
# Groups:   shelf [3]
   shelf Nutrient mean_amount
   <int> <chr>          <dbl>
 1     1 calories      102.  
 2     1 carbo          15.8 
 3     1 fat             0.6 
 4     1 fiber           1.68
 5     1 potass         75.5 
 6     1 protein         2.65
 7     1 sodium        176.  
 8     1 sugars          4.8 
 9     1 vitamins       20   
10     2 calories      110.  
# … with 17 more rows
cereal_summary2 |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

pivot_wider()

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))
mean_protein
# A tibble: 18 × 3
# Groups:   manuf [7]
   manuf shelf mean_protein
   <fct> <int>        <dbl>
 1 A         2         4   
 2 G         1         3   
 3 G         2         1.29
 4 G         3         2.67
 5 K         1         2.75
 6 K         2         2.14
 7 K         3         2.92
 8 N         1         2.67
 9 N         2         2.5 
10 N         3         4   
11 P         1         1.5 
12 P         2         1   
13 P         3         3   
14 Q         1         5   
15 Q         2         2   
16 Q         3         2.5 
17 R         1         2   
18 R         3         3   
protein_wide <- mean_protein |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
protein_wide
# A tibble: 7 × 4
# Groups:   manuf [7]
  manuf   `1`   `2`   `3`
  <fct> <dbl> <dbl> <dbl>
1 G      3     1.29  2.67
2 K      2.75  2.14  2.92
3 N      2.67  2.5   4   
4 P      1.5   1     3   
5 Q      5     2     2.5 
6 R      2    NA     3   
7 A     NA     4    NA   

Better names in pivot_wider()

protein_wide <- mean_protein |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf ")
protein_wide
# A tibble: 7 × 4
# Groups:   manuf [7]
  manuf `Shelf 1` `Shelf 2` `Shelf 3`
  <fct>     <dbl>     <dbl>     <dbl>
1 G          3         1.29      2.67
2 K          2.75      2.14      2.92
3 N          2.67      2.5       4   
4 P          1.5       1         3   
5 Q          5         2         2.5 
6 R          2        NA         3   
7 A         NA         4        NA   

Data Joins

Relational Data

Multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.

IMDb movie relational data

Data Joins

Mutating joins

Adds information from a new dataframe to observations in an existing dataframe

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filters observations based on values in new dataframe

semi_join(), anti_join()

Keys

Uniquely identifies an observation in a data set

Relate data sets to each other

inner_join()

Matches pairs of observations when “keys” are equal

Inner Join: IMDb Example

directors_genres
# A tibble: 10 × 3
# Groups:   director_id [5]
   director_id genre      prob
         <dbl> <chr>     <dbl>
 1         429 Adventure 0.75 
 2         429 Fantasy   0.75 
 3        2931 Drama     0.714
 4        2931 Action    0.429
 5       11652 Sci-Fi    0.5  
 6       11652 Action    0.5  
 7       14927 Animation 1    
 8       14927 Family    1    
 9       15092 Comedy    0.545
10       15092 Crime     0.545

Directors: 429, 2931, 9247, 11652, 14927, 15092

movies_directors
# A tibble: 7 × 2
  director_id movie_id
        <dbl>    <dbl>
1         429   300229
2        9247   124110
3       11652    10920
4       11652   333856
5       14927   192017
6       15092   109093
7       15092   237431

Directors: 429, 2931, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors)
# A tibble: 12 × 4
# Groups:   director_id [4]
   director_id genre      prob movie_id
         <dbl> <chr>     <dbl>    <dbl>
 1         429 Adventure 0.75    300229
 2         429 Fantasy   0.75    300229
 3       11652 Sci-Fi    0.5      10920
 4       11652 Sci-Fi    0.5     333856
 5       11652 Action    0.5      10920
 6       11652 Action    0.5     333856
 7       14927 Animation 1       192017
 8       14927 Family    1       192017
 9       15092 Comedy    0.545   109093
10       15092 Comedy    0.545   237431
11       15092 Crime     0.545   109093
12       15092 Crime     0.545   237431

Directors: 429, 2931, 9247, 11652, 14927, 15092

Inner Join: IMDb Example

What if our key variable is not named the same?

directors_genres
# A tibble: 10 × 3
# Groups:   director_id [5]
   director_id genre      prob
         <dbl> <chr>     <dbl>
 1         429 Adventure 0.75 
 2         429 Fantasy   0.75 
 3        2931 Drama     0.714
 4        2931 Action    0.429
 5       11652 Sci-Fi    0.5  
 6       11652 Action    0.5  
 7       14927 Animation 1    
 8       14927 Family    1    
 9       15092 Comedy    0.545
10       15092 Crime     0.545
directors
# A tibble: 5 × 3
     id first_name last_name
  <dbl> <chr>      <chr>    
1   429 Andrew     Adamson  
2  9247 Zach       Braff    
3 11652 James (I)  Cameron  
4 14927 Ron        Clements 
5 15092 Ethan      Coen     
inner_join(directors_genres, 
           directors, 
           by = c("director_id" = "id")
           )
# A tibble: 8 × 5
     id first_name last_name genre      prob
  <dbl> <chr>      <chr>     <chr>     <dbl>
1   429 Andrew     Adamson   Adventure 0.75 
2   429 Andrew     Adamson   Fantasy   0.75 
3 11652 James (I)  Cameron   Sci-Fi    0.5  
4 11652 James (I)  Cameron   Action    0.5  
5 14927 Ron        Clements  Animation 1    
6 14927 Ron        Clements  Family    1    
7 15092 Ethan      Coen      Comedy    0.545
8 15092 Ethan      Coen      Crime     0.545

Mutating Joins

  • left_join() Everything is kept in the data set on the left

  • right_join() Everything is kept in the data set on the right

  • full_join() Everything is kept in both data sets

Mutating Joins

Discuss with a neighbor.

Which of the following directors would be kept in for each of:

  • left_join(directors_genres, movies_directors)
  • right_join(directors_genres, movies_directors)
  • full_join(directors_genres, movies_directors)
directors_genres |> 
  distinct(director_id)
# A tibble: 5 × 1
# Groups:   director_id [5]
  director_id
        <dbl>
1         429
2        2931
3       11652
4       14927
5       15092
movies_directors |> 
  distinct(director_id)
# A tibble: 5 × 1
  director_id
        <dbl>
1         429
2        9247
3       11652
4       14927
5       15092

Filtering Joins: semi_join()

semi_join() Keeping observations

semi_join(directors_genres, movies_directors)
# A tibble: 8 × 3
# Groups:   director_id [4]
  director_id genre      prob
        <dbl> <chr>     <dbl>
1         429 Adventure 0.75 
2         429 Fantasy   0.75 
3       11652 Sci-Fi    0.5  
4       11652 Action    0.5  
5       14927 Animation 1    
6       14927 Family    1    
7       15092 Comedy    0.545
8       15092 Crime     0.545

Movie Directors: 429, 2931, 11652, 14927, 15092

Including observations with %in%

directors_genres |>
  filter(director_id %in% c(429, 9247, 11652, 14927, 15092))
# A tibble: 8 × 3
# Groups:   director_id [4]
  director_id genre      prob
        <dbl> <chr>     <dbl>
1         429 Adventure 0.75 
2         429 Fantasy   0.75 
3       11652 Sci-Fi    0.5  
4       11652 Action    0.5  
5       14927 Animation 1    
6       14927 Family    1    
7       15092 Comedy    0.545
8       15092 Crime     0.545

similar to semi_join()!

Filtering Joins: anti_join()

anti_join() Removing Observations

anti_join(directors_genres, movies_directors)
# A tibble: 2 × 3
# Groups:   director_id [1]
  director_id genre   prob
        <dbl> <chr>  <dbl>
1        2931 Drama  0.714
2        2931 Action 0.429

Movie Directors: 429, 2931, 11652, 14927, 15092

Excluding observations with !%in%

directors_genres |>
  filter(!director_id %in% c(429, 9247, 11652, 14927, 15092))
# A tibble: 2 × 3
# Groups:   director_id [1]
  director_id genre   prob
        <dbl> <chr>  <dbl>
1        2931 Drama  0.714
2        2931 Action 0.429

similar to anti_join()!

A note about piping joins

inner_join(directors_genres, movies_directors)

directors_genres |> 
  inner_join(movies_directors)

PA 4: Military Spending

Today you will be tidying untidy data to explore the relationship between countries of the world and military spending.

Due Wednesday, 2/1 at 8:00am

Bonus Challenge: Murder Mystery in SQL City

For this challenge, you will be using table joins to solve a murder mystery.

Due Sunday, 2/12 at 11:59pm

To do…

  • PA 4: Military Spending
    • Due Wednesday 2/1 at 8:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Sunday 2/12 at 11:59pm

Wednesday, January 25th

Today we will…

  • Review PA 4: Military Spending
  • Practice with Relational Data
  • Lab 4: Avocado Prices
  • Challenge 3: Avocado Toast Ate My Mortgage

Lab + Challenge

Lab 4: Avocado Prices + Challenge 4: Avocado Toast Ate My Mortgage


Handy Helpers

rename() – Change names of columns

separate() – Separate values of a variable


Filtering Joins

semi_join(): Keeps values found in another data set

anti_join(): Keeps values not found in another data set

To do…

  • Lab 4: Avocado Prices
    • Due Friday, 2/3 at 11:59pm
  • Challenge 4: Avocado Toast Ate My Mortgage
    • Due Saturday, 2/4 at 11:59pm
  • Read Chapter 5: Special Data Types
    • Concept Check 5.1 + 5.2 + 5.3 due Monday (2/6) at 8:00am